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May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture, Volume 25 issue 2 
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The performance tradeoff between hardware complexity and clock speed is studied. First, a 
generic superscalar pipeline is defined. Then the specific areas of register renaming, 
instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is 
modeled and Spice simulated for feature sizes of 0.8µm, 0.35µm, and 
0.18µm. Performance results and trends are expressed in terms of issue width and 
window size. Our analysis indicates that window wakeu ... 
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Dynamic superscalar processors execute multiple instructions out-of-order by looking for 
independent operations within a large window. The number of physical registers within the 
processor has a direct impact on the size of this window as most in-flight instructions 
require a new physical register at dispatch. A large multi-ported register file helps improve 
the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, 
especially in future wire-limited technologies. ... 

Dynamically managing the communication-parallelism trade-off in iuiut e cluste red 
processors 
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May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 31 issue 2 
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Clustered microarchitectures are an attractive alternative to large monolithic superscalar 
designs due to their potential for higher clock rates in the face of increasingly wire-delay- 
constrained process technologies. As increasing transistor counts allow an increase in the 
number of clusters, thereby allowing more aggressive use of instruction-level parallelism 
(ILP), the inter-cluster communication increases as data values get spread across a wider 
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area. As a result of the emergence of this tr ... 
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Previous proposals for implementing instruction-level temporalredundancy in out-of-order 
cores have reported a performancedegradation of upto 45% in certain applications 
compared to anexecution which does not have any temporal redundancy. An 
importantcontributor to this problem is the insufficient number ofALUs for handling the 
amplified load injected into the core. At thesame time, increasing the number of ALUs can 
increase the complexityof the issue logic, which has been pointed out to be oneo ... 
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August 2004 Proceedings of the 2004 international symposium on Low power 
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As clock frequency and die area increase, achieving energy efficiency, while distributing a 
low skew, global clock signal becomes increasingly difficult. Challenges imposed by deep- 
submicron technologies can be alleviated by using a multiple voltage/multiple frequency 
island design style, or otherwise called, globally asynchronous, locally synchronous (GALS) 
design paradigm. This paper proposes a clustered architecture that enables application- 
adaptive energy efficiency through the use of dynami ... 

Keywords: clustered architectures, dynamic voltage scaling 
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The multiciuster architecture that we introduce offers a decentralized, dynamically- 
scheduled architecture, in which the register files, dispatch queue, and functional units of 
the architecture are distributed across multiple clusters, and each cluster is assigned a 
subset of the architectural registers. The motivation for the multiciuster architecture is to 
reduce the clock cycle time, relative to a single-cluster architecture with the same number 
of hardware resources, by reducing the size and ... 
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Keywords: decentralized architecture, partitioned architecture, static instruction 
scheduling, register allocation 
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One of the main concerns in today's processor design Is the issue logic. Instruction-level 
parallelism is usually favored by an out-of-order issue mechanism where instructions can 
issue independently of the program order. The out-of-order scheme yields the best 
performance but at the same time introduces important hardware costs such as an 
associative look-up, which might be prohibitive for wide issue processors with large 
instruction windows. This associative search may slow-down t ... 

Keywords: in-order issue, instruction issue logic, out-of-order issue, wide-issue superscalar 
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annual international symposium on Computer architecture, volume 31 issue 2 
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Despite their superior performance for multimedia applications, vector processors have 
three limitations that hinder their widespread acceptance. First, the complexity and size of 
the centralized vector register file limits the number of functional units. Second, precise 
exceptions for vector instructions are difficult to implement. Third, vector processors require 
an expensive on-chip memory system that supports high bandwidth at low access 
latency .This paper introduces CODE, a scalable vector ... 
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Traces are dynamic instruction sequences constructed and cached by hardware. A 
microarchitecture organized around traces is presented as a means for efficiently executing 
many instructions per cycle. Trace processors exploit both control flow and data flow 
hierarchy to overcome complexity and architectural limitations of conventional superscalar 
processors by (1) distributing execution resources based on trace boundaries and (2) 
applying control and data prediction at the trace level rather than ... 

Keywords: trace processors, multiscalar processors, trace cache, next trace prediction, 
selective reissuing, context-based value prediction 
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Clustered VLIW architectures have been widely adopted in modern embedded multimedia 
applications for their ability to exploit high degrees of ILP with reasonable trade-off in 
complexity and silicon costs. Studies have however shown limited performance scaling for 
wide-issue machines. In this paper we describe the architecture of a clustered VLIW with a 
runtime reconfigurable inter-cluster bus suitable to address such scalability problem. The 
architecture is aimed at kernel loops acceleration thr ... 

Keywords: IDCT, clustered VLIW, modulo scheduling, reconfigurable co-processor (RCP) 
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The issue logic of dynamically scheduled superscalar processors is one of their most complex 
and power-consuming parts. In this paper we present alternative issue-logic designs that 
are much simpler than the traditional scheme while they retain most of its ability to exploit 
ILP. These alternative schemes are based on the observation that most values produced by 
a program are used by very few instructions, and the latencies of most operation are 
deterministic. 
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In recent years reducing power has become a critical design goal for high-performance 
microprocessors. This work attempts to bring the power issue to the earliest phase of high- 
performance microprocessor development. We propose a methodology for power- 
optimization at the micro-architectural level. First, major targets for power reduction are 
identified within superscalar microarchitecture, then an optimization of a superscalar micro- 
architecture is performed that generates a set of ... 
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In some of today's superscalar processors (e.g. the Pentium III), the result repositories are 
implemented as the Reorder Buffer (ROB) slots. In such designs, the ROB is a complex 
multi-ported structure that occupies a significant portion of the die area and dissipates a 
non-trivial fraction of the total chip power, as much as 27% according to some estimates. In 
addition, an access to such ROB typically takes more than one cycle, impacting the IPC 
adversely. We propose a low-complexity and low-powe ... 

Keywords: low-complexity datapath, low-power design, reorder buffer 
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An instruction set architecture (ISA) suitable for future microprocessor design constraints is 
proposed. The ISA has hierarchical register files with a small number of accumulators at the 
top. The instruction stream is divided into chains of dependent instructions (strands) where 
intra-strand dependences are passed through the accumulator. The general-purpose register 
file is used for communication between strands and for holding global values that have 
many consumers. A microarchitecture to supp ... 
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Multiported register files are a critical component of high-performance superscalar 
microprocessors. Conventional multiported structures can consume significant power and die 
area. We examine the designs of banked multiported register files that employ multiple 
interleaved banks of fewer ported register cells to reduce power and area. Banked register 
files designs have been shown to provide sufficient bandwidth for a superscalar machine, 
but previous designs had complex control structures that w ... 
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This' work examines dynamic cluster assignment for a clustered trace cache processor 
(CTCP). Previously proposed cluster assignment techniques run into unique problems as 
issue width and cluster count increase. Realistic design conditions, such as variable data 
forwarding latencies between clusters and a heavily partitioned instruction window, increase 
the degree of difficulty for effective cluster assignment. In this work, the trace cache and fill 
unit are used to perform dynamic cluster assignme ... 
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The growing dominance of wire delays at future technology points renders a microprocessor 
communication-bound. Clustered microarchitectures allow most dependence chains to 
execute without being affected by long on-chip wire latencies. They also allow faster clock 
speeds and reduce design complexity, thereby emerging as a popular design choice for 
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future microprocessors. However, a centralized data cache threatens to be the primary 
bottle-neck in highly clustered systems. The paper attempts to id ... 

Keywords: clustered microarchitectures, communication-bound processors, data prefetch, 
distributed caches, effective address and memory dependence prediction, processor 
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